Universal Access To All Knowledge
Home Donate | Store | Blog | FAQ | Jobs | Volunteer Positions | Contact | Bios | Forums | Projects | Terms, Privacy, & Copyright
Search: Advanced Search
Anonymous User (login or join us)
Upload

Reply to this post | Go Back
View Post [edit]

Poster: Larry M Date: Jan 31, 2004 2:31pm
Forum: texts Subject: Texts in PDF

I was talking to Michael Lesk at the DIAL '04
conference, and it got me interested in the
Million Books project.

I took one of the books (Early Jazz) and tried
converting it to PDF using Acrobat 6, JBIG2
"lossy" compression on the TIF file, and it
turned the 88 MB .tif file into a 7.34 MB .pdf.

http://larry.masinter.net/earlyjazz-notext.pdf

If you OCR the pages so there is text as well
as images in the same file, you get 16.7MB:

http://larry.masinter.net/earlyjazz.pdf

This compares to 33.9MB for DjVu.

I imagine these only work with the Acrobat 6
reader, http://www.adobe.com/readstep
The "reduce file size" using JBIG2 cleans up
text images a bit, since it reduces noise.

Anyway, I'm interested in helping if someone wants
to pursue this; let me know.

Larry